Sparkify Churn

Measuring Churn Feature Engineering (Metric Design)

Machine Learning (Forcasting / Predicting)

Data Wrangling

Measureing Churn

Feature Engineering (Metric Design)

Feature: Tenure Days Check

Feature: UserAgent Check

Feature: Location Check

Feature: Gender check

Feature: Level check

Features: All kinds of events.

Creating analytic dataset and add metrics

n_session:

n_item:

total_length

n_play

n_like

n_dislike

n_addtolist

n_addfriend

n_artist

n_home

n_tenure

ratio_length_per_session

ratio_addtolist_per_play

ratio_like_per_play

ratio_dislike_per_play

Change measurement on primary metrics (n_play, total_length)

change_perc_n_play

change_perc_total_length

Flatten the metrics dataset

Clean the flatten analytic dataset

Assemble outcome label 'is_churn' to the flatten analytic dataset

Metric Cohort Analysis

Analyze how churn depends on the value of metrics

First run of metric cohort analysis. (Raw metric stats, without any normalization).

Skewness Check

Take out some metrics if needed based on skewness check.

Reselect the metrics based on acceptable skewness

Normalize Data

Second round of metric cohort analysis check. This time with normalized data and removed skewed metric.

The final round of metric cohort analysis. After removing another round of features.

I am satisfied with the final metric cohort analysis. Each of them do shows some degree of relationship to churn.

Now, add categorical data to the dataframe.


Machine Learning

Logistic Regression

The result of logistic regression is not very promising. I will not experiment further tuning for it in this case.


XGBoost

Check out real churn instances in test set and pred